Finance Data Capstone Project¶

In this project, I will be doing an exploratory data analysis of stock prices through data visualization. This is not meant to be an analysis for us to practically apply. So, while this will be interesting and I will make observations, this analysis is not to be taken as financial advice. We must always keep in mind that past performance does not gaurentee future performance.

This is my second capstone data analysis project for the Udemy course on "Python for data science and machine learning bootcamp".

In this project, we're going to focus on bank stocks during the financial crisis from January 1st, 2006 through January 1st, 2016.

Getting the data¶

First we'll need to obtain the data. We'll be using pandas datareader in order to get financial data from the Web.

Importing¶

In [1]:
from pandas_datareader import data, wb
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
%matplotlib inline

import plotly
import cufflinks as cf
cf.go_offline()

The Data¶

We're going to get the stock information for 6 larger banks during that time:

  • Bank of America
  • CitiGroup
  • Goldman Sachs
  • JPMorgan Chase
  • Morgan Stanley
  • Wells Fargo

We're going to be using Quandl for the source of our data and will use pandas datareader in order to create dataframes for each bank over our period

In [2]:
#Setting Jan 1st, 2006 through Jan 1st, 2016 as our dates for this analysis
start = datetime.datetime(2006,1,1)
end = datetime.datetime(2016,1,1)
In [3]:
#From the pandas_datareader documentation
import pandas_datareader.data as web
In [4]:
#From quandl documentation
import quandl
quandl.ApiConfig.api_key = "B1GhMwDUq4teS5ZUyjzR"
In [5]:
quandl_api_key ="B1GhMwDUq4teS5ZUyjzR"
In [6]:
#Bank of America
BAC = web.DataReader('WIKI/BAC','quandl',start,end,access_key = quandl_api_key)
In [7]:
#Check that it worked
BAC.head()
Out[7]:
Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume
Date
2015-12-31 17.01 17.07 16.83 16.83 47106760.0 0.0 1.0 16.564256 16.622684 16.388973 16.388973 47106760.0
2015-12-30 17.20 17.24 17.04 17.05 35035518.0 0.0 1.0 16.749277 16.788229 16.593470 16.603208 35035518.0
2015-12-29 17.25 17.35 17.16 17.28 45628449.0 0.0 1.0 16.797967 16.895347 16.710326 16.827181 45628449.0
2015-12-28 17.22 17.23 16.98 17.13 41759993.0 0.0 1.0 16.768753 16.778491 16.535042 16.681112 41759993.0
2015-12-24 17.32 17.38 17.22 17.27 29373415.0 0.0 1.0 16.866133 16.924561 16.768753 16.817443 29373415.0
In [8]:
#CitiGroup
C = web.DataReader('WIKI/C','quandl',start,end,access_key = quandl_api_key)
In [9]:
#Goldman Sachs
GS = web.DataReader('WIKI/GS','quandl',start,end,access_key = quandl_api_key)
In [10]:
#JPMorgan Chase
JPM = web.DataReader('WIKI/JPM','quandl',start,end,access_key = quandl_api_key)
In [11]:
#Morgan Stanley
MS = web.DataReader('WIKI/MS','quandl',start,end,access_key = quandl_api_key)
In [12]:
#Wells Fargo
WFC = web.DataReader('WIKI/WFC','quandl',start,end,access_key = quandl_api_key)

Now that we've imported the data, let's combine them into one dataframe

In [13]:
tickers = [BAC,C,GS,JPM,MS,WFC]
tickers_str = ['BAC','C','GS','JPM','MS','WFC']
bank_stocks = pd.concat(tickers,axis=1,keys = tickers_str).sort_values(by='Date')
#note that we added keys to help track what data belongs to which bank

#adding column names to make it easier to read and understand
bank_stocks.columns.names = ['Bank Ticker','Stock Info']
In [14]:
bank_stocks.head()
Out[14]:
Bank Ticker BAC ... WFC
Stock Info Open High Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow ... Low Close Volume ExDividend SplitRatio AdjOpen AdjHigh AdjLow AdjClose AdjVolume
Date
2006-01-03 46.92 47.18 46.15 47.08 16296700.0 0.0 1.0 37.126108 37.331837 36.516835 ... 62.39 63.80 5508200.0 0.0 1.0 22.799832 23.070400 22.507619 23.016287 11016400.0
2006-01-04 47.00 47.24 46.45 46.58 17757900.0 0.0 1.0 37.189409 37.379312 36.754214 ... 62.73 63.06 5435000.0 0.0 1.0 22.944135 22.958566 22.630277 22.749327 10870000.0
2006-01-05 46.58 46.83 46.32 46.64 14970700.0 0.0 1.0 36.857078 37.054894 36.651349 ... 62.62 62.99 5079000.0 0.0 1.0 22.727681 22.767364 22.590594 22.724074 10158000.0
2006-01-06 46.80 46.91 46.35 46.57 12599800.0 0.0 1.0 37.031156 37.118195 36.675087 ... 62.77 63.36 4201900.0 0.0 1.0 22.785402 22.926097 22.644707 22.857554 8403800.0
2006-01-09 46.72 46.97 46.36 46.60 15619400.0 0.0 1.0 36.967855 37.165671 36.683000 ... 63.11 63.35 2809800.0 0.0 1.0 22.853946 22.962173 22.767364 22.853946 5619600.0

5 rows × 72 columns

Exploratory Data Analysis¶

The dataframe looks to be in place so let's begin with our exploration

We'll start by finding the max close price for each bank's stock during this period.

In [15]:
bank_stocks.unstack().xs('Close',level=1).groupby('Bank Ticker').max()
Out[15]:
Bank Ticker
BAC     54.90
C       60.34
GS     247.92
JPM     70.08
MS      89.30
WFC     73.00
dtype: float64

From our data, it looks like the highest close over this period was Goldman Sachs at 247.92.

When looking at stocks, though, we're also interested in the return (the percent change) of the stocks. Let's make a dataframe that represents this for each bank.

In [16]:
returns = bank_stocks.pct_change().xs('Close', level = 'Stock Info', axis=1)
returns.columns = ["BAC Return", "C Return", "GS Return", "JPM Return", "MS Return", "WFC Return"]
returns.columns.names = ['Bank']
returns.head()
Out[16]:
Bank BAC Return C Return GS Return JPM Return MS Return WFC Return
Date
2006-01-03 NaN NaN NaN NaN NaN NaN
2006-01-04 -0.010620 -0.018462 -0.013812 -0.014183 0.000686 -0.011599
2006-01-05 0.001288 0.004961 -0.000393 0.003029 0.002742 -0.001110
2006-01-06 -0.001501 0.000000 0.014169 0.007046 0.001025 0.005874
2006-01-09 0.000644 -0.004731 0.012030 0.016242 0.010586 -0.000158

Now let's look at this data together using a seaborn pairplot to quickly plot the returns against each other and get distribution plots. We should see that the data roughly moves in a positive linear line (higher the return in one bank, higher the return in another because they share the same industry - this is of course unless one significantly outperforms the rest.)

In [17]:
sns.pairplot(returns)
C:\Users\Brandon\Anaconda3\lib\site-packages\numpy\lib\histograms.py:754: RuntimeWarning:

invalid value encountered in greater_equal

C:\Users\Brandon\Anaconda3\lib\site-packages\numpy\lib\histograms.py:755: RuntimeWarning:

invalid value encountered in less_equal

Out[17]:
<seaborn.axisgrid.PairGrid at 0x2393333dc50>

We see that, for the most part, we see what we expected - a positive, linear line of data points. However, we also notice that Citigroup looks much different than the rest of these graphs. This seems to be due to an extreme outlier. We also notice that all of the axis are different, so we can't compare banks to each other as easily as we should be able to. Let's fix that with another plotting.

In [18]:
#We'll be using a Pair Grid for more flexibility.
g = sns.PairGrid(returns.dropna())
g.map_diag(plt.hist)
g.map_upper(sns.scatterplot)
g.map_lower(sns.scatterplot)

for ax in g.axes.flat:
    ax.set_yticks([-.5,0,.5])
    ax.set_xticks([-.5,0,.5])

We see that, while this didn't help our understanding of Citigroup, this makes it a little easier to compare the other banks to each other. We see that, if the points don't make as clear of a shape or are in the vertical or horizontal direction, it implies that they are not correlated, but if they are shaped more in a diagonal from bottom left to top right, then they are more correlated. A good example of a correlated relationship would be JPM and WFC, where it's very close to a diagonal line. An example with less correlation would be MS and WFC where it doesn't have a real form or shape.

Let's look into this correlation using heat and cluster maps to give us more understanding of how these banks are connected.

In [19]:
sns.heatmap(bank_stocks.xs('Close',level=1,axis=1).corr(),annot=True,cmap='coolwarm')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x2393336f6a0>
In [20]:
sns.clustermap(bank_stocks.xs('Close',level=1,axis=1).corr(), cmap='coolwarm', annot=True)
Out[20]:
<seaborn.matrix.ClusterGrid at 0x23935728da0>

We see that the bank combinations from before that we identified as likely being correlated, such as JPM and WFC, are indeed more correlated than what we identified as not being correlated, such as MS and WFC.

We see that, simply by looking at these relationships visually, we can at least get a hint of what these relationships are like and we can determine where to look for further confirmation and investigation, ultimately leading, after confirming those correlations, to insights.

For our example, one could say that, directionally, Morgan Stanley (MS) and Bank of America (BAC) would go in the same direction. You could then hypothesize that MS and BAC would have similar returns.

To see if this hypothesis is true, let's make another heatmap of the returns data to see if it's correlated.

In [21]:
#returns data correlated heat map
sns.heatmap(returns.dropna().corr(), annot = True, cmap='coolwarm')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x239385eef28>

We see that the returns for MS and BAC aren't as correlated as we expected. Let's explore why this could be the case

In [22]:
#maybe the average magnitude of the stock prices are big enough to skew this
bank_stocks.xs('Close',level=1, axis=1).agg('mean')
Out[22]:
Bank Ticker
BAC     22.174575
C       33.001709
GS     158.481289
JPM     45.985346
MS      36.006091
WFC     37.615326
dtype: float64

As we can see, the average stock price for Morgan Stanley (MS) is almost double the stock price of Bank of America (BAC). Therefore, even if the stock move is highly correlated, the magnitude return percentage of that move would be different due to the different starting magnitudes, leaving the percent returns to being different enough numbers to lower the correlation of the two.

Let's continue exploring the returns dataframe that we made and see what we can find. Let's find the single best and worst day returns for each bank.

In [23]:
#min single day return % date
returns.dropna().idxmin()
Out[23]:
Bank
BAC Return   2009-01-20
C Return     2009-02-27
GS Return    2009-01-20
JPM Return   2009-01-20
MS Return    2008-10-09
WFC Return   2006-08-14
dtype: datetime64[ns]

We see that 3 of the banks had the lowest return % on the same date. Upon investigation on what happened on January 20th, 2009, we find that that was the day when former president Barack Obama was sworn in as president of the United States. It seems that Wallstreet reacted to this change in political climate.

In [24]:
#max single day return % date
returns.dropna().idxmax()
Out[24]:
Bank
BAC Return   2009-04-09
C Return     2011-05-09
GS Return    2008-11-24
JPM Return   2009-01-21
MS Return    2008-10-13
WFC Return   2008-07-16
dtype: datetime64[ns]

With the max returns, we notice that JP Morgan (JPM) essentially bounced back the day after Barack Obama's inauguration. This means that we can't read too much into stock moves because of Obama's inauguration, we can only say that stocks are shaky when there is significant political change.

What we're interested in, though, is why Citigroup had such a high return datapoint (seen earlier in the scatterplot graphs). Upon further investigation, we see that Citigroup issued a "reverse stock split" that came into effect on May, 9th, 2011. The reverse stock split essentially combined shares of 10 into one share while keeping the value per share. This means that the value per share stayed the same but essentially combined, giving a percieved 10x the value of the share at the time, which is why citigroup had such a large anomoly in the returns percentage of its graph.

Now let's take a look at the standard deviation (also known as volatility) of the bank returns over the time period.

In [25]:
returns.dropna().std()
Out[25]:
Bank
BAC Return    0.036628
C Return      0.179066
GS Return     0.025358
JPM Return    0.027651
MS Return     0.037820
WFC Return    0.031838
dtype: float64

It seems like, over the 10 year period, Citigroup has the most volatility and could be seen as being the riskiest of the stocks over the time period. However, much of this difference is most likely due to the outlier that we found previously. Let's look at the returns for different years to see how they would match up in general.

In [26]:
#2007
returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2007].std()
Out[26]:
Bank
BAC Return    0.013765
C Return      0.018235
GS Return     0.021539
JPM Return    0.016850
MS Return     0.024872
WFC Return    0.016979
dtype: float64
In [27]:
#2008, the year of the financial crisis
returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2008].std()
Out[27]:
Bank
BAC Return    0.062934
C Return      0.073523
GS Return     0.050015
JPM Return    0.053134
MS Return     0.087604
WFC Return    0.052382
dtype: float64
In [28]:
#2015
returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2015].std()
Out[28]:
Bank
BAC Return    0.016163
C Return      0.015289
GS Return     0.014046
JPM Return    0.014017
MS Return     0.016249
WFC Return    0.012591
dtype: float64

We see that, while Citigroup (C) has the highest standard deviation over the whole period, Morgan Stanley (MS) has a higher standard deviation for 2008. Therefore, because Citigroup's standard deviation over the whole period is over 4 times that of MS, we can make the case that this increase is most likely due to the anomoly caused by the reverse stock split.

Now let's look at the distribution of returns for both Citigroup (C) and Morgan Stanley (MS), the two most volatile banks over the time period, for 2007, 2008, and 2015. So we can compare these years, we're going to set all of the x-axis to equal the same across each graph

In [29]:
#2007
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
#MS
sns.distplot(returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2007]['MS Return'], bins=100,color = 'green', ax=axes[0])
#C
sns.distplot(returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2007]['C Return'], bins=100,color = 'blue', ax=axes[1])
axes[0].get_shared_x_axes().join(axes[0], axes[1])
axes[0].get_shared_y_axes().join(axes[0], axes[1])
axes[0].set_xlim(-.2,.2)
fig.suptitle('2007 Returns Distribution')
C:\Users\Brandon\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

Out[29]:
Text(0.5, 0.98, '2007 Returns Distribution')
In [30]:
#2008 
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
#MS
sns.distplot(returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2008]['MS Return'], bins=100,color = 'green',ax=axes[0])
#C
sns.distplot(returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2008]['C Return'], bins=100,color = 'blue', ax=axes[1])
axes[0].get_shared_x_axes().join(axes[0], axes[1])
axes[0].get_shared_y_axes().join(axes[0], axes[1])
axes[0].set_xlim(-.2,.2)
fig.suptitle('2008 Returns Distribution')
Out[30]:
Text(0.5, 0.98, '2008 Returns Distribution')
In [31]:
#2015 
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14,5))
#MS
sns.distplot(returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2015]['MS Return'], bins=100,color = 'green', ax=axes[0])
#C
sns.distplot(returns.dropna().reset_index()[returns.dropna().reset_index()['Date'].dt.year==2015]['C Return'], bins=100,color = 'blue', ax=axes[1])
axes[0].get_shared_x_axes().join(axes[0], axes[1])
axes[0].get_shared_y_axes().join(axes[0], axes[1])
axes[0].set_xlim(-.2,.2)
fig.suptitle('2015 Returns Distribution')
Out[31]:
Text(0.5, 0.98, '2015 Returns Distribution')

We see that, as expected, the variance in 2008 is much wider due to the volatility that occured during the financial crisis.

Let's look at the close price of all of these stocks over this period - in other words, let's see the movement of each stock over this period

In [32]:
bank_stocks.xs('Close',level=1, axis=1).iplot()

As expected, 2008 is when a lot of the downside happened. Let's look at some rolling averages for a couple of these banks.

In [33]:
#to make it clear and easier to follow, we'll make a new dataframe for each graph we do
bank_stocks_closing08=bank_stocks.xs('Close',level=1, axis=1).reset_index()[bank_stocks.xs('Close',level=1, axis=1).reset_index()['Date'].dt.year==2008]
In [34]:
#MS
fig, axes = plt.subplots(figsize = (12,3))
axes.plot(bank_stocks_closing08.set_index('Date').rolling(window=30).mean()['MS'],color='blue', label='30 Day Average')
axes.plot(bank_stocks_closing08.set_index('Date')['MS'],color='green', label='MS')
axes.set_ylim(0,55)
axes.set_title('MS 2008')
plt.legend(loc='upper right')
Out[34]:
<matplotlib.legend.Legend at 0x23939837cf8>
In [35]:
#C
fig, axes = plt.subplots(figsize = (12,3))
axes.plot(bank_stocks_closing08.set_index('Date').rolling(window=30).mean()['C'],color='orange',label='30 Day Average')
axes.plot(bank_stocks_closing08.set_index('Date')['C'],color='blue',label='C')
axes.set_ylim(0,35)
axes.set_title('C 2008')
plt.legend(loc='upper right')
Out[35]:
<matplotlib.legend.Legend at 0x23939637e48>
In [36]:
#GS
fig, axes = plt.subplots(figsize = (12,3))
axes.plot(bank_stocks_closing08.set_index('Date').rolling(window=30).mean()['GS'],color='black',label='30 Day Average')
axes.plot(bank_stocks_closing08.set_index('Date')['GS'],color='red',label='GS')
axes.set_ylim(0,225)
axes.set_title('GS 2008')
plt.legend(loc='upper right')
Out[36]:
<matplotlib.legend.Legend at 0x239392f0cc0>

We see very similar shapes with regards to each of the moving averages for these three banks.

Technical Analysis Plots¶

Let's use some of cufflinks' advanced features for making some technical analysis plots for the most recent year in the dataset, 2015

In [53]:
#Candlestick chart
bank_stocks['MS'].reset_index()[bank_stocks['MS'].reset_index()['Date'].dt.year==2015].set_index('Date').iplot(kind='candle')

Candle stick charts are one of the most common charts stock traders look at in assessing a stock's move.

In [55]:
#Simple Moving Averages
bank_stocks['MS'].reset_index()[bank_stocks['MS'].reset_index()['Date'].dt.year==2015].set_index('Date')['Close'].ta_plot(study='sma',periods=[13,21,55])

We can see that we can easily make multiple moving average plots using cufflinks very easily through plotly.

In [56]:
#Bollinger Bands
bank_stocks['MS'].reset_index()[bank_stocks['MS'].reset_index()['Date'].dt.year==2015].set_index('Date')['Close'].ta_plot(study='boll', periods=12)

The above Bollinger Bands plot gives you a lot of information, including the lower and upper two standard deviation "limits" for the stock performance. This can be extremely useful to both see the moving average along with the volatility of the stock.

Conclusion¶

By analyzing this set of data, we saw that correlation in closing price didn't mean exact correlation in return % due to magnitude difference of stock price. We also visualized the 2008 downturn for these bank stocks and observed the increased volatility they endured. We found that a stock split by Citigroup created an anomaly that skewed some of our standard deviation results; if we wanted to do more analysis, we could consider accounting for that anomaly by normalizing our data based on market capitalization.

Being able to analyze web stock data through Quandl using pandas with python, especially while using very little code, is incredibly powerful to run your own, detailed analysis.

In [ ]: